Improving Predicted Accuracy of Breast Cancer Diagnosis with Machine Learning Models

Introduction

For this project, I will be using the Universal Workflow introduced in Deep Learning With Python 4.5. While I am aware that this book mainly covers deep learning techniques focusing on neural networks, I believe this workflow to be extendible to traditional machine learning algorithms as well.

Defining the problem: Clearly define the problem, and understand the task at hand, the available data, and the outcome desired.

Preparing the data: Transform raw data into a form that is appropriate for use in a deep learning model. This may include data cleaning, normalization, encoding, splitting into training/validation/test sets, etc.

The following steps will then be repeated for each machine learning model that will be explored in this project.

Defining the model: Choose an appropriate architecture for the problem, including the number of layers, the types of layers, the activation functions, etc.

Compiling the model: For neural networks, specify the optimizer, the loss function, and the metrics that will be used to evaluate the model during training.

Training the model: Train the model on the training data set.

Evaluating the model: Evaluate the performance of the model on the test set to estimate its real-world performance.

Tuning the model: If the performance is not satisfactory, adjust the hyperparameters, and repeat the previous steps until a satisfactory model is obtained.

Using the model: Deploy the trained model on new data to make predictions or classifications.

To this end, I will be using the Breast Cancer Wisconsin (Diagnostic) dataset from Kaggle, which consists of data computed from digitised images of tumour cells extracted by Fine Needle Aspiration.

Problem Statement

The goal is to explore the efficacy of different machine learning models and deep neural networks in determining whether the tumour represented in the images is benign or malignant. This problem is a binary classification problem, as there are only two classes that the tumours can be: benign and malignant. We will be using the Breast Cancer Wisconsin (Diagnostic) Data Set found on Kaggle for this project, which contains data computed from digitised images of Fine Needle Aspirates (FNAs) of breast masses, which describe characteristics of cell nuclei in the masses.

Measures of success

The main measures of success by which we will be judging our model will be recall, followed by accuracy. This is because, for this problem, the cost of false negatives is higher than the cost of false positives, as a misdiagnosis of a tumour as benign may result in delays in treatment, leading to the cancer progressing, and becoming harder to treat. However, while that is the case, the costs of false positives are not negligible either, and cannot be ignored, as it can lead to costly treatments, additional tests, and emotional stress inflicted upon the patients. Therefore, recall, which reflects the percentage of false negatives, will be our first priority as a metric for success.

In [1]:
# General Use Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Machine Learning Libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import recall_score, accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

# Neural Network Libraries
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D,Flatten,Dense,Dropout
In [2]:
# Number that will be used for random seeds, to ensure replicable results
state = 73

from numpy.random import seed
from tensorflow.keras.utils import set_random_seed
seed(state)
set_random_seed(state)
In [3]:
# Loading the dataset
df =  pd.read_csv('data.csv')
df.head()
Out[3]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 NaN
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 NaN

5 rows × 33 columns

Exploratory Data Analysis

In [4]:
# Get brief overview of datatypes, as well as check for any null entries
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB

As we can see, the dataset was imported into the notebook with an empty column, likely due to a missing comma in the CSV file. Therefore, the first step to take is to drop that column.

In [5]:
# Drop empty column
df = df.drop(['Unnamed: 32'], axis = 1)
In [6]:
# Check new shape of dataframe
df.shape
Out[6]:
(569, 32)
In [7]:
# Get statistical overview of columns
df.describe()
Out[7]:
id radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 5.690000e+02 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 3.037183e+07 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 1.250206e+08 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 8.670000e+03 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 8.692180e+05 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 9.060240e+05 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 8.813129e+06 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 31 columns

Breakdown of dataset columns

This dataset includes an id column, a diagnosis column, and three sets of data columns - the mean of each feature for each sample, the worst, or largest mean value, of each feature, and the standard error for each feature. These are denoted by the columns titled [feature]_mean, [feature[_worst, and [feature]_serespectively.

Among these, the diagnosis column will act as the class column for the classification task, while the id column will serve as the identification column, which will largely be unused in the couurse of this project.

The remaining 30 columns, on the other hand, are columns that reflect measurements of physical characteristics of the tumoours, and therefore are more likely to be indicative of whether the tumour is malignant or benign. This will be the pool of potential features to be used in the classification of the tumours.

In [8]:
# Feature standard error columns
se_features = [x for x in df.columns if '_se' in x]
se_df = df[se_features]
In [9]:
# Feature mean columns
mean_features = [x for x in df.columns if '_mean' in x]
mean_df = df[mean_features]
In [10]:
# Feature worst columns
worst_features = [x for x in df.columns if '_worst' in x]
worst_df = df[worst_features]

Data Visualisation

In [11]:
class_counts = df['diagnosis'].value_counts(ascending=False).values

sns.countplot(data=df,x='diagnosis')
Out[11]:
<AxesSubplot:xlabel='diagnosis', ylabel='count'>
In [12]:
# Helper function to create pairplot for features
def plotGrid(df, y, features):
    features.append(y)
    df_plot = df[features]
    sns.pairplot(data = df_plot, hue = y)
In [13]:
# Plot all mean features
plotGrid(df, 'diagnosis', mean_features)
In [14]:
# Plot all worst features
plotGrid(df, 'diagnosis', worst_features)
In [15]:
# Plot all feature standard errors
plotGrid(df, 'diagnosis', se_features)

From the pairplots, we can tell that not all the features available in the dataset are especially relevant in determining if a tumour is benign or malignant. Some of the graphs show that the particular feature has a similar distribution of results whether the tumour is benign or malignant, for example, the fractal_dimension_se feature.

In [16]:
fig, ax = plt.subplots(ncols=5, nrows=2, figsize=(16,8))

ax = ax.flatten()
for idx, col in enumerate(mean_df.columns):
    sns.histplot(data=mean_df, x=col,kde=True,ax=ax[idx])
    
plt.tight_layout(pad=0.5,h_pad=0.8,w_pad=0.5)
In [17]:
fig, ax = plt.subplots(ncols=5, nrows=2, figsize=(16,8))

ax = ax.flatten()
for idx, col in enumerate(worst_df.columns):
    sns.histplot(data=worst_df, x=col,kde=True,ax=ax[idx])
    
plt.tight_layout(pad=0.5,h_pad=0.8,w_pad=0.5)
In [18]:
fig, ax = plt.subplots(ncols=5, nrows=2, figsize=(16,8))

ax = ax.flatten()
for idx, col in enumerate(se_df.columns):
    sns.histplot(data=se_df, x=col,kde=True,ax=ax[idx])
    
plt.tight_layout(pad=0.5,h_pad=0.8,w_pad=0.5)

Preparing the data

In [19]:
# Label Encoding
encoder = LabelEncoder()
df['diagnosis'] = encoder.fit_transform(df['diagnosis'])
y = df[['diagnosis']]
y
Out[19]:
diagnosis
0 1
1 1
2 1
3 1
4 1
... ...
564 1
565 1
566 1
567 1
568 0

569 rows × 1 columns

First, we scale the model using the minmax scaler sklearn provides. This scaler was chosen as the output of the data will all be between 0 and 1, which works for our purposes as some of the machine learning models cannot accomodate negative input data.

In [20]:
# Get features and scale
scaler = MinMaxScaler()
x = df.drop(['id', 'diagnosis'], axis = 1)
scaler.fit(x)
x = scaler.transform(x)
x
Out[20]:
array([[0.52103744, 0.0226581 , 0.54598853, ..., 0.91202749, 0.59846245,
        0.41886396],
       [0.64314449, 0.27257355, 0.61578329, ..., 0.63917526, 0.23358959,
        0.22287813],
       [0.60149557, 0.3902604 , 0.59574321, ..., 0.83505155, 0.40370589,
        0.21343303],
       ...,
       [0.45525108, 0.62123774, 0.44578813, ..., 0.48728522, 0.12872068,
        0.1519087 ],
       [0.64456434, 0.66351031, 0.66553797, ..., 0.91065292, 0.49714173,
        0.45231536],
       [0.03686876, 0.50152181, 0.02853984, ..., 0.        , 0.25744136,
        0.10068215]])
In [21]:
# Split the dataset into training, testing, and validation sets, at a ratio of 2:7:1
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = state)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.125, random_state=state)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
y_val = np.ravel(y_val)
In [22]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape, X_val.shape, y_val.shape
Out[22]:
((398, 30), (398,), (114, 30), (114,), (57, 30), (57,))

Earlier, it was established that not all features may be helpful in solving this problem. Therefore, the more relevant featurers will have to be picked out for use. However, conjecture based on visual information is insufficient to disqualify features, and thus we will be using the 'SelectKBest' module available in scikit-learn, which calculates the chi-squared value for the features and uses it to select the k best features for use with machine learning models.

In [23]:
# Find 10 best scored features
n_features=10
select_feature = SelectKBest(chi2, k=n_features).fit(X_train, y_train)
X_train_selected = select_feature.transform(X_train)
X_val_selected = select_feature.transform(X_val)
X_test_selected = select_feature.transform(X_test)

X_train_selected.shape, X_val_selected.shape, X_test_selected.shape
Out[23]:
((398, 10), (57, 10), (114, 10))

We will use a pandas dataframe to store the final results.

In [24]:
index = 1
results_df =  pd.DataFrame(columns=['Index','Model Name','Training Set Recall','Training Set Accuracy','Testing Set Recall','Testing Set Accuracy'])
results_df
Out[24]:
Index Model Name Training Set Recall Training Set Accuracy Testing Set Recall Testing Set Accuracy
In [25]:
# Helper function to insert results into results dataframe
def insert_results(name, r1, r2, r3, r4):
    global index, results_df
    results_df = pd.concat([results_df, pd.DataFrame({'Index': [index], 'Model Name': [name],
                    'Training Set Recall':[r1],'Training Set Accuracy':[r2],'Testing Set Recall':[r3],'Testing Set Accuracy':[r4]})])
    index += 1

K-Nearest Neighbours

The first machine learning model we will be evaluating will be the K-Nearest Neighbours algorithm. The K-Nearest Neighbours Algorithm classifies entries by looking at each data point and searching for the k nearest data points to the data point, then decides which class they are by the majority classification among those k points. First, we will run the model on default parameters.

In [26]:
knn_model = KNeighborsClassifier()
knn_model.fit(X_train_selected,y_train)
knn_model.get_params()
Out[26]:
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}
In [27]:
# Training set results
y_pred = knn_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[27]:
(0.9103448275862069, 0.957286432160804)
In [28]:
# Testing set results
y_pred = knn_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[28]:
(0.9130434782608695, 0.956140350877193)

Now, we will use grid search cross-validation to tune the hyper parameters we have. We do this by compiling a reasonable parameter grid - in this case, we will try different values for n_neighbours, the number of neighbours that the KNN algorithm will take into consideration, and the weights system, which determines how heavily different data points are weighted.

In [29]:
parameters = {"n_neighbors":np.linspace(1,10,10).astype(int), "weights":["uniform","distance"]}
knn_optimised = GridSearchCV(knn_model, parameters, cv=5,scoring="recall")
knn_optimised.fit(X_train_selected, y_train)
knn_optimised.best_estimator_.get_params()
Out[29]:
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 1,
 'p': 2,
 'weights': 'uniform'}

As we can see, the grid search has returned a model where the n_neighbours hyperparameter has changed. Now, we will test the new model by evaluating its performance against the training and testing set.

In [30]:
# Training set results
y_pred = knn_optimised.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[30]:
(1.0, 1.0)
In [31]:
# Testing set results
y_pred = knn_optimised.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[31]:
(0.9130434782608695, 0.9385964912280702)

The performance of the model on the training set has increased to 100%. This suggests that the model may be overfitted. However, the performance on the testing set has gotten worse, with the accuracy decreasing from 95.6% to 93.9%. In this case, we should repeat the tuning on the hyperparameters, but avoid the change made with this tuning, as it seems to have caused the model to become overfitted.

In [32]:
parameters = {"n_neighbors":np.linspace(2,11,10).astype(int), "weights":["uniform","distance"]}
knn_optimised = GridSearchCV(knn_model, parameters, cv=5,scoring="recall")
knn_optimised.fit(X_train_selected, y_train)
knn_optimised.best_estimator_.get_params()
Out[32]:
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 2,
 'p': 2,
 'weights': 'distance'}
In [33]:
# Testing set results
y_pred = knn_optimised.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[33]:
(1.0, 1.0)
In [34]:
# Testing set results
y_pred = knn_optimised.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[34]:
(0.9130434782608695, 0.9385964912280702)
In [35]:
parameters = {"n_neighbors":np.linspace(3,12,10).astype(int), "weights":["uniform","distance"]}
knn_optimised = GridSearchCV(knn_model, parameters, cv=5,scoring="recall")
knn_optimised.fit(X_train_selected, y_train)
knn_optimised.best_estimator_.get_params()
Out[35]:
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 3,
 'p': 2,
 'weights': 'uniform'}
In [36]:
y_pred = knn_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
Out[36]:
(0.9379310344827586, 0.964824120603015)
In [37]:
y_pred = knn_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
Out[37]:
(0.9565217391304348, 0.9649122807017544)
In [38]:
insert_results("K-Nearest Neighbours", train_recall, train_accuracy, test_recall, test_accuracy)

After repeating the grid search twice, avoiding the values of 1 and 2 on the n_neighbours hyperparameter as they cause the model to be overfitted, we find that n_neighbours = 3 causes the model to perform much better than any of the previous configurations on the testing set. While it also causes the performance on the training set to worsen, that indicates that the model is not overfitted, and can generalise better.

Support Vector Classifier

The next machine learning model we will be evaluating will be a Support Vector Classifier. Support Vector Classifiers use hyperplanes to define decision boundaries, which they use to solve classification problems in high dimensiona spaces. First, we will run the model on default parameters.

In [39]:
svc_model = SVC(gamma='auto', class_weight={0: class_counts[0], 1: class_counts[1]}, random_state=state)
svc_model.fit(X_train_selected,y_train)
svc_model.get_params()
Out[39]:
{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': {0: 357, 1: 212},
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'auto',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': 73,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}
In [40]:
y_pred = svc_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[40]:
(0.8758620689655172, 0.9547738693467337)
In [41]:
y_pred = svc_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[41]:
(0.8913043478260869, 0.956140350877193)
In [42]:
cm = confusion_matrix(y_test, y_pred)
cm
Out[42]:
array([[68,  0],
       [ 5, 41]], dtype=int64)

We can see that the support vector classified performed worse than the K Nearest Neighbours algorithm under default parameters.

Next, we will again use grid search cross-validation to tune the hyper parameters we have. In this case, we will try different values for C, the regularisation parameter, which controls the margin of error allowed to the support vector classifier in the construction of the hyperplane, coef0, the coefficient, and the kernel, the type of function used to construct the hyperplane.

In [43]:
parameters = {'C': np.logspace(-3, 3, 100),'kernel': ['linear', 'sigmoid'],
              'gamma':['scale', 'auto'],'coef0':np.linspace(0, 10, 10).astype(int)}
svc_optimised = RandomizedSearchCV(svc_model, parameters,scoring="recall",random_state=state)
svc_optimised.fit(X_train_selected, y_train)
svc_optimised.best_estimator_.get_params()
Out[43]:
{'C': 657.9332246575682,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': {0: 357, 1: 212},
 'coef0': 4,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'linear',
 'max_iter': -1,
 'probability': False,
 'random_state': 73,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}
In [44]:
y_pred = svc_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
Out[44]:
(0.9448275862068966, 0.9748743718592965)
In [45]:
y_pred = svc_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
Out[45]:
(0.9130434782608695, 0.956140350877193)
In [46]:
insert_results("Support Vector Classifier", train_recall, train_accuracy, test_recall, test_accuracy)

In this case, grid search gave a configuration for the model that sees a marked improvement in the performace of the model, bumping the up performance scores of the model on both the training and testing sets, though the accuracy of the model on the testing set remained constant.

Random Forest Classifier

Next, we will use the Random Forest Classifier to tackle this problem. The Random Forest Classifier is a machine learning algorithm that combines multiple decision trees to make predictions on input data provided to it. Firstly, we will test it on the problem with default parameters, only giving it the weights of the classes to go off of.

In [47]:
rf_model = RandomForestClassifier(
    random_state = state,
    class_weight={0: class_counts[0], 1: class_counts[1]})
rf_model.fit(X_train_selected,y_train)
rf_model.get_params()
Out[47]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': {0: 357, 1: 212},
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 73,
 'verbose': 0,
 'warm_start': False}
In [48]:
y_pred = rf_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[48]:
(1.0, 1.0)
In [49]:
y_pred = rf_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[49]:
(0.8913043478260869, 0.9385964912280702)

As we can see, the Random Forest Classifier achieves a 100% recall on default parameters. This could be a sign that the model is overfitted, and we will need to move on to the standard hyperparameter tuning. However, unlike the previous models, we cannot just use GridSearchCV as the model is already performing at 100% for the training set, and thus there is no further room for improvement by GridSearchCV's standards.

With that in mind, the approach will be to perform manual hyperparameter tuning, focusing on trying to reduce overfitting. The first thing we can try to reduce overfitting will be to lower the number of estimators in the random forest from the default value of 100. We will first try to establish a lower bound for the value of n_estimators by dramatically reducing the value until the recall score of the random forest classifier on the training set drops below 100%, or the recall score on the test set drops below the initial value of 0.891.

In [50]:
rf_model = RandomForestClassifier(
    random_state = state,
    n_estimators = 20,
    class_weight={0: class_counts[0], 1: class_counts[1]})
rf_model.fit(X_train_selected,y_train)
rf_model.get_params()
Out[50]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': {0: 357, 1: 212},
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 20,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 73,
 'verbose': 0,
 'warm_start': False}
In [51]:
y_pred = rf_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[51]:
(1.0, 1.0)
In [52]:
y_pred = rf_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[52]:
(0.8913043478260869, 0.9298245614035088)
In [53]:
rf_model = RandomForestClassifier(
    random_state = state,
    n_estimators = 5,
    class_weight={0: class_counts[0], 1: class_counts[1]})
rf_model.fit(X_train_selected,y_train)
rf_model.get_params()
Out[53]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': {0: 357, 1: 212},
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 5,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 73,
 'verbose': 0,
 'warm_start': False}
In [54]:
y_pred = rf_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[54]:
(0.9517241379310345, 0.9798994974874372)
In [55]:
y_pred = rf_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[55]:
(0.9130434782608695, 0.9385964912280702)

In this case, reducing the number of estimators to 5 reduces the recall score on the training set, but caused the recall score of the testing set to increase. This indicates that the model has gotten better at generalising its results, and is no longer as overfittedd as before. However, this does not mean that it is now optimised, though it does mean that we can now employ methods like GridSearchCV to try to optimise its parameters.

In [56]:
parameters = {'min_samples_leaf': [1,2,3,4], 'n_estimators': np.linspace(2,8,num=7).astype(int), 'max_depth': [10,20,None]}
rf_optimised = GridSearchCV(rf_model, parameters,scoring="recall", refit=True)
rf_optimised.fit(X_train_selected, y_train)
rf_optimised.best_estimator_.get_params()
Out[56]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': {0: 357, 1: 212},
 'criterion': 'gini',
 'max_depth': 10,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 3,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 73,
 'verbose': 0,
 'warm_start': False}
In [57]:
y_pred = rf_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
Out[57]:
(0.9586206896551724, 0.9798994974874372)
In [58]:
y_pred = rf_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
Out[58]:
(0.9565217391304348, 0.9473684210526315)
In [59]:
insert_results("Random Forest Classifier", train_recall, train_accuracy, test_recall, test_accuracy)

In this case, the grid search cross validation similarly boosted scores across the board, giving us a significant increase in both the recall and accuracy scores on the testing set.

Naive Bayes Classifier

The next machine learning model that will be explored is the Naive Bayes classifier. The Naive Bayes classifier is a machine learning algorithm that uses Bayes' theorem to classify a data point, that assumes that features are independent.

In [60]:
# Build a Gaussian Classifier
nb_model = GaussianNB()

# Model training
nb_model.fit(X_train_selected,y_train)

nb_model.get_params()
Out[60]:
{'priors': None, 'var_smoothing': 1e-09}
In [61]:
y_pred = nb_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[61]:
(0.9172413793103448, 0.9422110552763819)
In [62]:
y_pred = nb_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[62]:
(0.9565217391304348, 0.956140350877193)

Now for hyperparameter testing. The Naive Bayes classifier does not have many hyperparameters that require tuning, only var_smoothing, which is the variable that adds small value to the variances of the features.

In [63]:
parameters = {'var_smoothing': np.logspace(0,-10, num=100)}
nb_optimised = GridSearchCV(nb_model, parameters,scoring="recall")
nb_optimised.fit(X_train_selected, y_train)
nb_optimised.best_estimator_.get_params()
Out[63]:
{'priors': None, 'var_smoothing': 0.01519911082952934}
In [64]:
y_pred = nb_optimised.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[64]:
(0.903448275862069, 0.9371859296482412)
In [65]:
y_pred = nb_optimised.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[65]:
(0.9347826086956522, 0.9473684210526315)

As we can see, a larger value of var_smoothing seems to make the model perform worse. With that in mind, we can retry the tuning using a range of numbers that center on the default value.

In [66]:
parameters = {'var_smoothing': np.logspace(-4,-14, num=100)}
nb_optimised = GridSearchCV(nb_model, parameters,scoring="recall")
nb_optimised.fit(X_train_selected, y_train)
nb_optimised.best_estimator_.get_params()
Out[66]:
{'priors': None, 'var_smoothing': 0.0001}
In [67]:
y_pred = nb_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
Out[67]:
(0.9172413793103448, 0.9422110552763819)
In [68]:
y_pred = nb_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
Out[68]:
(0.9565217391304348, 0.956140350877193)
In [69]:
insert_results("Naive Bayes Classifier", train_recall, train_accuracy, test_recall, test_accuracy)

In this case, it would seem that the default value for the hyperparameter performed the best.

Logistic Regression

The final machine learning algorithm we will be looking at is logistic regression. Logistic regression is a statistical model that estimates the probability of a data point belonging to each class by fitting a logistic function to the data.

In [70]:
lr_model = LogisticRegression(class_weight={0: class_counts[0], 1: class_counts[1]}, random_state=state, max_iter=5000)
lr_model.fit(X_train_selected,y_train)
lr_model.get_params()
Out[70]:
{'C': 1.0,
 'class_weight': {0: 357, 1: 212},
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 5000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 73,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}
In [71]:
y_pred = lr_model.predict(X_train_selected)
recall = recall_score(y_train, y_pred)
accuracy = accuracy_score(y_train, y_pred)
(recall, accuracy)
Out[71]:
(0.9310344827586207, 0.9723618090452262)
In [72]:
y_pred = lr_model.predict(X_test_selected)
recall = recall_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
Out[72]:
(0.9130434782608695, 0.956140350877193)

In the case of logistic regression, the hyperparameters that need to be looked at have a larger range of values than the previous models. This means that grid search cross validation will take a bit more time to run its course than for the others. We will need to look at a value of C, the regularisation parameter, penalty, the type of regularisation, and solver, different optimisation algorithms for fitting the model.

In [73]:
parameters = {'C': np.logspace(-2, 2, 99)
              ,'penalty': ['l1', 'l2', 'elasticnet']
              ,'solver':['lbfgs'
                         , 'liblinear'
                         , 'newton-cg'
                         , 'newton-cholesky'
                        ]
             }
lr_optimised = GridSearchCV(lr_model, parameters,scoring="recall")
lr_optimised.fit(X_train_selected, y_train)
lr_optimised.best_estimator_.get_params()
F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:372: FitFailedWarning: 
3960 fits failed out of a total of 5940.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

--------------------------------------------------------------------------------
1485 fits failed with the following error:
Traceback (most recent call last):
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 434, in _check_solver
    raise ValueError(
ValueError: Logistic Regression supports only solvers in ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'], got newton-cholesky.

--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got elasticnet penalty.

--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 457, in _check_solver
    raise ValueError(
ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.

--------------------------------------------------------------------------------
495 fits failed with the following error:
Traceback (most recent call last):
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
F:\Users\Sonata\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:969: UserWarning: One or more of the test scores are non-finite: [       nan 0.86206897        nan ...        nan        nan        nan]
  warnings.warn(
Out[73]:
{'C': 0.05963623316594643,
 'class_weight': {0: 357, 1: 212},
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 5000,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l1',
 'random_state': 73,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In this case, we are getting many warnings. This is due to the fact that some hyperparameters are inherently incompatible with others - for example, the 'newton-cg' solver with the 'elasticnet' penalty. The algorithm ignores the incompatible permutations of hyperparameters.

In [74]:
y_pred = lr_optimised.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
Out[74]:
(0.9310344827586207, 0.9698492462311558)
In [75]:
y_pred = lr_optimised.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
Out[75]:
(0.9130434782608695, 0.956140350877193)
In [76]:
y_pred = lr_model.predict(X_train_selected)
train_recall = recall_score(y_train, y_pred)
train_accuracy = accuracy_score(y_train, y_pred)
(train_recall, train_accuracy)
Out[76]:
(0.9310344827586207, 0.9723618090452262)
In [77]:
y_pred = lr_model.predict(X_test_selected)
test_recall = recall_score(y_test, y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
Out[77]:
(0.9130434782608695, 0.956140350877193)
In [78]:
insert_results("Logistic Regression", train_recall, train_accuracy, test_recall, test_accuracy)

In this case, it seems that the grid search cross-validation returned a value worse than that of the default parameters, therefore we will take the results obtained from the default.

Single Layer Perceptron

Next, we will be experimenting with a single layer perceptron. A single layer perceptron is the simplest form of a neural network, comprising only one input layer and one output layer, with no hidden layers. We will be building this single layer perceptron with 256 neurons in the input layer.

In [79]:
model_0 = Sequential([
    Dense(128,activation='relu',input_shape=(n_features,)),
    Dense(1,activation='sigmoid')])
model_0.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 128)               1408      
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
=================================================================
Total params: 1,537
Trainable params: 1,537
Non-trainable params: 0
_________________________________________________________________
In [80]:
model_0.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Accuracy', 'Recall'])
In [81]:
n_epoch = 20
history_0 = model_0.fit(x=X_train_selected,y=y_train,
                    validation_data=(X_val_selected, y_val),
                    epochs=n_epoch)
Epoch 1/20
13/13 [==============================] - 2s 34ms/step - loss: 0.6883 - Accuracy: 0.7638 - recall: 0.3862 - val_loss: 0.6727 - val_Accuracy: 0.8947 - val_recall: 0.9524
Epoch 2/20
13/13 [==============================] - 0s 4ms/step - loss: 0.6583 - Accuracy: 0.8643 - recall: 0.9724 - val_loss: 0.6433 - val_Accuracy: 0.8596 - val_recall: 0.9524
Epoch 3/20
13/13 [==============================] - 0s 4ms/step - loss: 0.6292 - Accuracy: 0.8618 - recall: 0.9793 - val_loss: 0.6137 - val_Accuracy: 0.8947 - val_recall: 0.9524
Epoch 4/20
13/13 [==============================] - 0s 4ms/step - loss: 0.5983 - Accuracy: 0.8894 - recall: 0.9586 - val_loss: 0.5812 - val_Accuracy: 0.9474 - val_recall: 0.9524
Epoch 5/20
13/13 [==============================] - 0s 4ms/step - loss: 0.5653 - Accuracy: 0.9347 - recall: 0.9103 - val_loss: 0.5454 - val_Accuracy: 0.9298 - val_recall: 0.9048
Epoch 6/20
13/13 [==============================] - 0s 4ms/step - loss: 0.5287 - Accuracy: 0.9372 - recall: 0.9172 - val_loss: 0.5073 - val_Accuracy: 0.9474 - val_recall: 0.9524
Epoch 7/20
13/13 [==============================] - 0s 4ms/step - loss: 0.4901 - Accuracy: 0.9347 - recall: 0.8828 - val_loss: 0.4659 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 8/20
13/13 [==============================] - 0s 4ms/step - loss: 0.4503 - Accuracy: 0.9322 - recall: 0.8690 - val_loss: 0.4259 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 9/20
13/13 [==============================] - 0s 4ms/step - loss: 0.4128 - Accuracy: 0.9296 - recall: 0.8552 - val_loss: 0.3884 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 10/20
13/13 [==============================] - 0s 4ms/step - loss: 0.3784 - Accuracy: 0.9372 - recall: 0.8828 - val_loss: 0.3549 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 11/20
13/13 [==============================] - 0s 8ms/step - loss: 0.3471 - Accuracy: 0.9397 - recall: 0.8897 - val_loss: 0.3248 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 12/20
13/13 [==============================] - 0s 4ms/step - loss: 0.3198 - Accuracy: 0.9322 - recall: 0.8621 - val_loss: 0.2967 - val_Accuracy: 0.9649 - val_recall: 0.9048
Epoch 13/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2963 - Accuracy: 0.9322 - recall: 0.8621 - val_loss: 0.2744 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 14/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2758 - Accuracy: 0.9347 - recall: 0.8690 - val_loss: 0.2563 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 15/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2587 - Accuracy: 0.9347 - recall: 0.8621 - val_loss: 0.2376 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 16/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2444 - Accuracy: 0.9347 - recall: 0.8621 - val_loss: 0.2237 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 17/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2317 - Accuracy: 0.9372 - recall: 0.8690 - val_loss: 0.2123 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 18/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2227 - Accuracy: 0.9397 - recall: 0.8828 - val_loss: 0.2038 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 19/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2131 - Accuracy: 0.9372 - recall: 0.8690 - val_loss: 0.1936 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 20/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2060 - Accuracy: 0.9372 - recall: 0.8690 - val_loss: 0.1872 - val_Accuracy: 0.9474 - val_recall: 0.9048

After training the model, we will plot out the training data, and look to see if there

In [82]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history_0.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history_0.history["val_Accuracy"], label="val_acc")

plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
Out[82]:
<matplotlib.legend.Legend at 0x13846b0fa30>

From the accuracy graph, we can see that the training model has been fitted to the data quite well, as the training set accuracy and validation set accuracy are quite close to each other.

In [83]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history_0.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history_0.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
Out[83]:
<matplotlib.legend.Legend at 0x13846ffc070>
In [84]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history_0.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history_0.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
Out[84]:
<matplotlib.legend.Legend at 0x138482cbbe0>
In [85]:
y_pred = model_0.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 1ms/step
Out[85]:
(0.9347826086956522, 0.9473684210526315)
In [86]:
insert_results("Single Layer Perceptron", history_0.history["recall"][-1], history_0.history["Accuracy"][-1], test_recall, test_accuracy)

From this, we can see that the single layer perceptron exhibited a worse performance than some of the machine learning algorithms. It is likely that a multilayer perceptron, being more complex and containing regularisation techniques like dropout layers, will have a better performance.

Multilayer Perceptron

Following that, we will be looking at the multilayer perceptron, which is a type of feedforward neural network comprising multiple layers. The one we will be building will be a simple one one input layer, one output layer, and one hidden layer. It will also contain dropout layers, which help the model to prevent overfitting.

In [87]:
model_1 = Sequential([
    Dense(128,activation='relu',input_shape=(n_features,)),
    Dropout(0.2),
    Dense(64,activation='relu'),
    Dropout(0.2),
    Dense(1,activation='sigmoid')])
model_1.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_2 (Dense)             (None, 128)               1408      
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_3 (Dense)             (None, 64)                8256      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 9,729
Trainable params: 9,729
Non-trainable params: 0
_________________________________________________________________
In [88]:
model_1.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Accuracy', 'Recall'])
In [89]:
n_epoch = 20
history = model_1.fit(x=X_train_selected,y=y_train,
                    validation_data=(X_val_selected, y_val),
                    epochs=n_epoch)
Epoch 1/20
13/13 [==============================] - 2s 44ms/step - loss: 0.6775 - Accuracy: 0.5779 - recall: 0.8345 - val_loss: 0.6532 - val_Accuracy: 0.4561 - val_recall: 1.0000
Epoch 2/20
13/13 [==============================] - 0s 7ms/step - loss: 0.6377 - Accuracy: 0.7085 - recall: 0.9655 - val_loss: 0.6042 - val_Accuracy: 0.7895 - val_recall: 0.9524
Epoch 3/20
13/13 [==============================] - 0s 4ms/step - loss: 0.5772 - Accuracy: 0.8693 - recall: 0.9448 - val_loss: 0.5307 - val_Accuracy: 0.9298 - val_recall: 0.9524
Epoch 4/20
13/13 [==============================] - 0s 5ms/step - loss: 0.5034 - Accuracy: 0.9221 - recall: 0.8690 - val_loss: 0.4389 - val_Accuracy: 0.9123 - val_recall: 0.8571
Epoch 5/20
13/13 [==============================] - 0s 4ms/step - loss: 0.4043 - Accuracy: 0.9372 - recall: 0.9034 - val_loss: 0.3386 - val_Accuracy: 0.9298 - val_recall: 0.9048
Epoch 6/20
13/13 [==============================] - 0s 4ms/step - loss: 0.3214 - Accuracy: 0.9246 - recall: 0.8759 - val_loss: 0.2573 - val_Accuracy: 0.9474 - val_recall: 0.8571
Epoch 7/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2521 - Accuracy: 0.9246 - recall: 0.8345 - val_loss: 0.2057 - val_Accuracy: 0.9298 - val_recall: 0.9048
Epoch 8/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2096 - Accuracy: 0.9422 - recall: 0.8759 - val_loss: 0.1733 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 9/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1885 - Accuracy: 0.9397 - recall: 0.8897 - val_loss: 0.1563 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 10/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1798 - Accuracy: 0.9296 - recall: 0.8621 - val_loss: 0.1462 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 11/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1718 - Accuracy: 0.9397 - recall: 0.9241 - val_loss: 0.1421 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 12/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1553 - Accuracy: 0.9422 - recall: 0.8828 - val_loss: 0.1383 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 13/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1503 - Accuracy: 0.9397 - recall: 0.8966 - val_loss: 0.1367 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 14/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1494 - Accuracy: 0.9447 - recall: 0.9034 - val_loss: 0.1367 - val_Accuracy: 0.9649 - val_recall: 0.9524
Epoch 15/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1461 - Accuracy: 0.9472 - recall: 0.9241 - val_loss: 0.1350 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 16/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1511 - Accuracy: 0.9372 - recall: 0.8759 - val_loss: 0.1354 - val_Accuracy: 0.9649 - val_recall: 0.9524
Epoch 17/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1386 - Accuracy: 0.9497 - recall: 0.8966 - val_loss: 0.1353 - val_Accuracy: 0.9649 - val_recall: 0.9524
Epoch 18/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1505 - Accuracy: 0.9322 - recall: 0.9379 - val_loss: 0.1351 - val_Accuracy: 0.9474 - val_recall: 0.9048
Epoch 19/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1360 - Accuracy: 0.9422 - recall: 0.8897 - val_loss: 0.1353 - val_Accuracy: 0.9649 - val_recall: 0.9524
Epoch 20/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1456 - Accuracy: 0.9497 - recall: 0.9172 - val_loss: 0.1369 - val_Accuracy: 0.9474 - val_recall: 0.9524
In [90]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
Out[90]:
<matplotlib.legend.Legend at 0x1384ba65370>
In [91]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
Out[91]:
<matplotlib.legend.Legend at 0x138493888b0>
In [92]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
Out[92]:
<matplotlib.legend.Legend at 0x1384ac96e50>
In [93]:
y_pred = model_1.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 1ms/step
Out[93]:
(0.9565217391304348, 0.956140350877193)
In [94]:
insert_results("Multilayer Perceptron", history_0.history["recall"][-1], history_0.history["Accuracy"][-1], test_recall, test_accuracy)

Convolutional Neural Network

Finally, we will be working with a convolutional neural network. A convolutional neural network (CNN), is a type of neural network that has a convolutional layer for its input layer, and specialises in processing data structured in grid form, like image data.

Define the Model

As a convolutional neural network requires 3-dimensional data as input, we will reshape our input data to add an extra dimension.

In [95]:
X_train_3d = X_train_selected.reshape(X_train_selected.shape[0], X_train_selected.shape[1], 1)
X_test_3d = X_test_selected.reshape(X_test_selected.shape[0], X_test_selected.shape[1], 1)
X_val_3d = X_val_selected.reshape(X_val_selected.shape[0], X_val_selected.shape[1], 1)
X_train_3d.shape, X_test_3d.shape, X_val_3d.shape
Out[95]:
((398, 10, 1), (114, 10, 1), (57, 10, 1))
In [96]:
cnn_model = Sequential([
    Conv1D(filters=32,kernel_size=2,activation='relu',input_shape=(n_features,1)),
    Dropout(0.2),
    Flatten(),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(1,activation='sigmoid')
])
cnn_model.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv1d (Conv1D)             (None, 9, 32)             96        
                                                                 
 dropout_2 (Dropout)         (None, 9, 32)             0         
                                                                 
 flatten (Flatten)           (None, 288)               0         
                                                                 
 dense_5 (Dense)             (None, 64)                18496     
                                                                 
 dropout_3 (Dropout)         (None, 64)                0         
                                                                 
 dense_6 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 18,657
Trainable params: 18,657
Non-trainable params: 0
_________________________________________________________________
In [97]:
cnn_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Recall','Accuracy'])
In [98]:
# Fit the model
history = cnn_model.fit(X_train_3d, y_train,
                    validation_data=(X_val_3d, y_val),
                    epochs=n_epoch)
Epoch 1/20
13/13 [==============================] - 2s 26ms/step - loss: 0.6691 - recall: 0.9655 - Accuracy: 0.4598 - val_loss: 0.6401 - val_recall: 1.0000 - val_Accuracy: 0.4561
Epoch 2/20
13/13 [==============================] - 0s 5ms/step - loss: 0.6215 - recall: 0.9862 - Accuracy: 0.6683 - val_loss: 0.5831 - val_recall: 0.9524 - val_Accuracy: 0.7895
Epoch 3/20
13/13 [==============================] - 0s 5ms/step - loss: 0.5577 - recall: 0.9586 - Accuracy: 0.8643 - val_loss: 0.4962 - val_recall: 0.9524 - val_Accuracy: 0.9123
Epoch 4/20
13/13 [==============================] - 0s 4ms/step - loss: 0.4684 - recall: 0.8690 - Accuracy: 0.9196 - val_loss: 0.3857 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 5/20
13/13 [==============================] - 0s 5ms/step - loss: 0.3668 - recall: 0.8690 - Accuracy: 0.9246 - val_loss: 0.2825 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 6/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2752 - recall: 0.8690 - Accuracy: 0.9296 - val_loss: 0.2125 - val_recall: 0.9048 - val_Accuracy: 0.9649
Epoch 7/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2327 - recall: 0.8345 - Accuracy: 0.9196 - val_loss: 0.1768 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 8/20
13/13 [==============================] - 0s 6ms/step - loss: 0.1946 - recall: 0.8897 - Accuracy: 0.9397 - val_loss: 0.1579 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 9/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1846 - recall: 0.9034 - Accuracy: 0.9372 - val_loss: 0.1508 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 10/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1751 - recall: 0.8759 - Accuracy: 0.9347 - val_loss: 0.1463 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 11/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1763 - recall: 0.9103 - Accuracy: 0.9347 - val_loss: 0.1443 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 12/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1625 - recall: 0.8621 - Accuracy: 0.9322 - val_loss: 0.1423 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 13/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1653 - recall: 0.8690 - Accuracy: 0.9347 - val_loss: 0.1421 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 14/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1475 - recall: 0.9034 - Accuracy: 0.9447 - val_loss: 0.1415 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 15/20
13/13 [==============================] - 0s 6ms/step - loss: 0.1580 - recall: 0.8897 - Accuracy: 0.9372 - val_loss: 0.1413 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 16/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1630 - recall: 0.8897 - Accuracy: 0.9397 - val_loss: 0.1412 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 17/20
13/13 [==============================] - 0s 7ms/step - loss: 0.1590 - recall: 0.8897 - Accuracy: 0.9322 - val_loss: 0.1427 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 18/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1560 - recall: 0.9379 - Accuracy: 0.9347 - val_loss: 0.1416 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 19/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1617 - recall: 0.8690 - Accuracy: 0.9246 - val_loss: 0.1417 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 20/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1640 - recall: 0.8759 - Accuracy: 0.9322 - val_loss: 0.1417 - val_recall: 0.9048 - val_Accuracy: 0.9298
In [99]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
Out[99]:
<matplotlib.legend.Legend at 0x13846798190>
In [100]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
Out[100]:
<matplotlib.legend.Legend at 0x13846782760>
In [101]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
Out[101]:
<matplotlib.legend.Legend at 0x13846f7c250>
In [102]:
y_pred = cnn_model.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(recall, accuracy)
4/4 [==============================] - 0s 1ms/step
Out[102]:
(0.9130434782608695, 0.956140350877193)
In [103]:
cnn_model1 = Sequential([
    Conv1D(filters=32,kernel_size=2,activation='relu',input_shape=(n_features,1)),
    Dropout(0.5),
    Flatten(),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(1,activation='sigmoid')
])
cnn_model1.summary()
Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv1d_1 (Conv1D)           (None, 9, 32)             96        
                                                                 
 dropout_4 (Dropout)         (None, 9, 32)             0         
                                                                 
 flatten_1 (Flatten)         (None, 288)               0         
                                                                 
 dense_7 (Dense)             (None, 64)                18496     
                                                                 
 dropout_5 (Dropout)         (None, 64)                0         
                                                                 
 dense_8 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 18,657
Trainable params: 18,657
Non-trainable params: 0
_________________________________________________________________
In [104]:
cnn_model1.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Recall','Accuracy'])
In [105]:
history = cnn_model1.fit(X_train_3d, y_train,
                    validation_data=(X_val_3d, y_val),
                    epochs=n_epoch)
Epoch 1/20
13/13 [==============================] - 2s 29ms/step - loss: 0.6868 - recall: 0.4966 - Accuracy: 0.6332 - val_loss: 0.6616 - val_recall: 1.0000 - val_Accuracy: 0.5789
Epoch 2/20
13/13 [==============================] - 0s 5ms/step - loss: 0.6513 - recall: 0.7862 - Accuracy: 0.7312 - val_loss: 0.6159 - val_recall: 0.9524 - val_Accuracy: 0.8070
Epoch 3/20
13/13 [==============================] - 0s 5ms/step - loss: 0.5989 - recall: 0.8276 - Accuracy: 0.8216 - val_loss: 0.5330 - val_recall: 0.9524 - val_Accuracy: 0.9298
Epoch 4/20
13/13 [==============================] - 0s 5ms/step - loss: 0.5107 - recall: 0.7379 - Accuracy: 0.8844 - val_loss: 0.4164 - val_recall: 0.8095 - val_Accuracy: 0.9298
Epoch 5/20
13/13 [==============================] - 0s 5ms/step - loss: 0.4076 - recall: 0.7379 - Accuracy: 0.8920 - val_loss: 0.3036 - val_recall: 0.9048 - val_Accuracy: 0.9649
Epoch 6/20
13/13 [==============================] - 0s 6ms/step - loss: 0.3030 - recall: 0.8483 - Accuracy: 0.9196 - val_loss: 0.2276 - val_recall: 0.9048 - val_Accuracy: 0.9649
Epoch 7/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2442 - recall: 0.8207 - Accuracy: 0.9196 - val_loss: 0.1867 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 8/20
13/13 [==============================] - 0s 4ms/step - loss: 0.2289 - recall: 0.8207 - Accuracy: 0.9045 - val_loss: 0.1643 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 9/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1892 - recall: 0.8759 - Accuracy: 0.9422 - val_loss: 0.1586 - val_recall: 0.9048 - val_Accuracy: 0.9123
Epoch 10/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1915 - recall: 0.8828 - Accuracy: 0.9271 - val_loss: 0.1478 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 11/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1918 - recall: 0.8828 - Accuracy: 0.9271 - val_loss: 0.1456 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 12/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1735 - recall: 0.9034 - Accuracy: 0.9397 - val_loss: 0.1435 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 13/20
13/13 [==============================] - 0s 7ms/step - loss: 0.1907 - recall: 0.8552 - Accuracy: 0.9171 - val_loss: 0.1428 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 14/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1741 - recall: 0.8897 - Accuracy: 0.9322 - val_loss: 0.1441 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 15/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1723 - recall: 0.8897 - Accuracy: 0.9296 - val_loss: 0.1429 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 16/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1711 - recall: 0.8759 - Accuracy: 0.9372 - val_loss: 0.1427 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 17/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1759 - recall: 0.8759 - Accuracy: 0.9246 - val_loss: 0.1431 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 18/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1892 - recall: 0.9103 - Accuracy: 0.9372 - val_loss: 0.1440 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 19/20
13/13 [==============================] - 0s 6ms/step - loss: 0.1743 - recall: 0.8759 - Accuracy: 0.9246 - val_loss: 0.1424 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 20/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1627 - recall: 0.9034 - Accuracy: 0.9422 - val_loss: 0.1425 - val_recall: 0.9048 - val_Accuracy: 0.9474
In [106]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
Out[106]:
<matplotlib.legend.Legend at 0x13846c4ee20>
In [107]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
Out[107]:
<matplotlib.legend.Legend at 0x1384697e1c0>
In [108]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
Out[108]:
<matplotlib.legend.Legend at 0x13846371d90>
In [109]:
y_pred = cnn_model1.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 1ms/step
Out[109]:
(0.9347826086956522, 0.956140350877193)
In [143]:
cnn_model2 = Sequential([
    Conv1D(filters=32,kernel_size=2,activation='relu',input_shape=(n_features,1)),
    Dropout(0.2),    
    Flatten(),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dropout(0.2),
    Dense(1,activation='sigmoid')
])
cnn_model2.summary()
Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv1d_7 (Conv1D)           (None, 9, 32)             96        
                                                                 
 dropout_18 (Dropout)        (None, 9, 32)             0         
                                                                 
 flatten_7 (Flatten)         (None, 288)               0         
                                                                 
 dense_21 (Dense)            (None, 32)                9248      
                                                                 
 dropout_19 (Dropout)        (None, 32)                0         
                                                                 
 dense_22 (Dense)            (None, 16)                528       
                                                                 
 dropout_20 (Dropout)        (None, 16)                0         
                                                                 
 dense_23 (Dense)            (None, 1)                 17        
                                                                 
=================================================================
Total params: 9,889
Trainable params: 9,889
Non-trainable params: 0
_________________________________________________________________
In [144]:
cnn_model2.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01),loss='binary_crossentropy',metrics=['Recall','Accuracy'])
In [145]:
history = cnn_model2.fit(X_train_3d, y_train,
                    validation_data=(X_val_3d, y_val),
                    epochs=n_epoch)
Epoch 1/20
13/13 [==============================] - 2s 26ms/step - loss: 0.5871 - recall: 0.6414 - Accuracy: 0.8216 - val_loss: 0.2930 - val_recall: 0.9048 - val_Accuracy: 0.9123
Epoch 2/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2461 - recall: 0.8276 - Accuracy: 0.9020 - val_loss: 0.1629 - val_recall: 0.9524 - val_Accuracy: 0.9298
Epoch 3/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2088 - recall: 0.8966 - Accuracy: 0.9271 - val_loss: 0.1510 - val_recall: 0.8571 - val_Accuracy: 0.9474
Epoch 4/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2039 - recall: 0.9103 - Accuracy: 0.9146 - val_loss: 0.2507 - val_recall: 0.8095 - val_Accuracy: 0.9298
Epoch 5/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2832 - recall: 0.8345 - Accuracy: 0.8794 - val_loss: 0.1660 - val_recall: 0.9048 - val_Accuracy: 0.9123
Epoch 6/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1890 - recall: 0.8276 - Accuracy: 0.9171 - val_loss: 0.1447 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 7/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1684 - recall: 0.8897 - Accuracy: 0.9271 - val_loss: 0.1613 - val_recall: 0.9048 - val_Accuracy: 0.9123
Epoch 8/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1609 - recall: 0.8828 - Accuracy: 0.9422 - val_loss: 0.1818 - val_recall: 0.9524 - val_Accuracy: 0.9298
Epoch 9/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1749 - recall: 0.8897 - Accuracy: 0.9296 - val_loss: 0.1443 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 10/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1581 - recall: 0.9172 - Accuracy: 0.9397 - val_loss: 0.1479 - val_recall: 0.8571 - val_Accuracy: 0.9298
Epoch 11/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1574 - recall: 0.9172 - Accuracy: 0.9422 - val_loss: 0.1667 - val_recall: 0.8571 - val_Accuracy: 0.9474
Epoch 12/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1623 - recall: 0.9241 - Accuracy: 0.9497 - val_loss: 0.1447 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 13/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1351 - recall: 0.9034 - Accuracy: 0.9523 - val_loss: 0.1483 - val_recall: 0.9524 - val_Accuracy: 0.9474
Epoch 14/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1483 - recall: 0.8966 - Accuracy: 0.9296 - val_loss: 0.1453 - val_recall: 0.9524 - val_Accuracy: 0.9474
Epoch 15/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1289 - recall: 0.9241 - Accuracy: 0.9523 - val_loss: 0.1404 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 16/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1517 - recall: 0.8897 - Accuracy: 0.9372 - val_loss: 0.1435 - val_recall: 0.9524 - val_Accuracy: 0.9474
Epoch 17/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1231 - recall: 0.9241 - Accuracy: 0.9598 - val_loss: 0.1654 - val_recall: 0.9524 - val_Accuracy: 0.9298
Epoch 18/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1330 - recall: 0.9103 - Accuracy: 0.9472 - val_loss: 0.1405 - val_recall: 0.9524 - val_Accuracy: 0.9474
Epoch 19/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1344 - recall: 0.9379 - Accuracy: 0.9623 - val_loss: 0.1590 - val_recall: 0.8571 - val_Accuracy: 0.9474
Epoch 20/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1435 - recall: 0.9172 - Accuracy: 0.9497 - val_loss: 0.1407 - val_recall: 0.8571 - val_Accuracy: 0.9474
In [146]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
Out[146]:
<matplotlib.legend.Legend at 0x1384632b8e0>
In [147]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
Out[147]:
<matplotlib.legend.Legend at 0x13857238df0>
In [148]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
Out[148]:
<matplotlib.legend.Legend at 0x1385837ed30>
In [149]:
y_pred = cnn_model2.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 2ms/step
Out[149]:
(0.9130434782608695, 0.9649122807017544)
In [131]:
cnn_model = Sequential([
    Conv1D(filters=32,kernel_size=2,activation='relu',input_shape=(n_features,1)),
    Dropout(0.2),
    Flatten(),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(1,activation='sigmoid')
])
cnn_model.summary()
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv1d_5 (Conv1D)           (None, 9, 32)             96        
                                                                 
 dropout_13 (Dropout)        (None, 9, 32)             0         
                                                                 
 flatten_5 (Flatten)         (None, 288)               0         
                                                                 
 dense_16 (Dense)            (None, 64)                18496     
                                                                 
 dropout_14 (Dropout)        (None, 64)                0         
                                                                 
 dense_17 (Dense)            (None, 1)                 65        
                                                                 
=================================================================
Total params: 18,657
Trainable params: 18,657
Non-trainable params: 0
_________________________________________________________________
In [132]:
cnn_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['Recall','Accuracy'])
In [133]:
# Fit the model
history = cnn_model.fit(X_train_3d, y_train,
                    validation_data=(X_val_3d, y_val),
                    epochs=n_epoch)
Epoch 1/20
13/13 [==============================] - 2s 27ms/step - loss: 0.6751 - recall: 0.8828 - Accuracy: 0.5176 - val_loss: 0.6533 - val_recall: 1.0000 - val_Accuracy: 0.4737
Epoch 2/20
13/13 [==============================] - 0s 5ms/step - loss: 0.6328 - recall: 0.9931 - Accuracy: 0.6834 - val_loss: 0.5941 - val_recall: 0.9524 - val_Accuracy: 0.8070
Epoch 3/20
13/13 [==============================] - 0s 5ms/step - loss: 0.5655 - recall: 0.9586 - Accuracy: 0.8518 - val_loss: 0.4981 - val_recall: 0.9524 - val_Accuracy: 0.9123
Epoch 4/20
13/13 [==============================] - 0s 5ms/step - loss: 0.4606 - recall: 0.8966 - Accuracy: 0.9322 - val_loss: 0.3827 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 5/20
13/13 [==============================] - 0s 5ms/step - loss: 0.3627 - recall: 0.8276 - Accuracy: 0.9146 - val_loss: 0.2745 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 6/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2587 - recall: 0.8759 - Accuracy: 0.9397 - val_loss: 0.2032 - val_recall: 0.9048 - val_Accuracy: 0.9649
Epoch 7/20
13/13 [==============================] - 0s 5ms/step - loss: 0.2173 - recall: 0.8345 - Accuracy: 0.9221 - val_loss: 0.1687 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 8/20
13/13 [==============================] - 0s 7ms/step - loss: 0.1843 - recall: 0.8621 - Accuracy: 0.9372 - val_loss: 0.1519 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 9/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1816 - recall: 0.8828 - Accuracy: 0.9322 - val_loss: 0.1488 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 10/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1707 - recall: 0.8690 - Accuracy: 0.9271 - val_loss: 0.1438 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 11/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1722 - recall: 0.9172 - Accuracy: 0.9447 - val_loss: 0.1418 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 12/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1689 - recall: 0.8621 - Accuracy: 0.9322 - val_loss: 0.1408 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 13/20
13/13 [==============================] - 0s 6ms/step - loss: 0.1667 - recall: 0.8621 - Accuracy: 0.9296 - val_loss: 0.1412 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 14/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1623 - recall: 0.9034 - Accuracy: 0.9422 - val_loss: 0.1413 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 15/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1534 - recall: 0.8759 - Accuracy: 0.9372 - val_loss: 0.1415 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 16/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1596 - recall: 0.8897 - Accuracy: 0.9347 - val_loss: 0.1406 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 17/20
13/13 [==============================] - 0s 4ms/step - loss: 0.1553 - recall: 0.9103 - Accuracy: 0.9497 - val_loss: 0.1424 - val_recall: 0.9048 - val_Accuracy: 0.9298
Epoch 18/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1565 - recall: 0.9310 - Accuracy: 0.9397 - val_loss: 0.1405 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 19/20
13/13 [==============================] - 0s 5ms/step - loss: 0.1518 - recall: 0.8828 - Accuracy: 0.9447 - val_loss: 0.1404 - val_recall: 0.9048 - val_Accuracy: 0.9474
Epoch 20/20
13/13 [==============================] - 0s 8ms/step - loss: 0.1491 - recall: 0.9103 - Accuracy: 0.9447 - val_loss: 0.1420 - val_recall: 0.9048 - val_Accuracy: 0.9298
In [134]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["Accuracy"], label="train_acc")
plt.plot(np.arange(0, n_epoch), history.history["val_Accuracy"], label="val_acc")
plt.title("Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.legend()
Out[134]:
<matplotlib.legend.Legend at 0x13855a3a0d0>
In [135]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["loss"], label="train_loss")
plt.plot(np.arange(0, n_epoch), history.history["val_loss"], label="val_loss")
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend()
Out[135]:
<matplotlib.legend.Legend at 0x13855a372b0>
In [136]:
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, n_epoch), history.history["recall"], label="train_recall")
plt.plot(np.arange(0, n_epoch), history.history["val_recall"], label="val_recall")
plt.title("Training Recall")
plt.xlabel("Epoch #")
plt.ylabel("Recall")
plt.legend()
Out[136]:
<matplotlib.legend.Legend at 0x13855aedb20>
In [137]:
y_pred = cnn_model.predict(X_test_selected)
y_pred = [1 if y > 0.5 else 0 for y in y_pred]
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 2ms/step
Out[137]:
(0.9565217391304348, 0.956140350877193)
In [138]:
insert_results("Convolutional Neural Network", history.history["recall"][-1], history.history["Accuracy"][-1], test_recall, test_accuracy)

Visualise the Results

In [139]:
results_df
Out[139]:
Index Model Name Training Set Recall Training Set Accuracy Testing Set Recall Testing Set Accuracy
0 1 K-Nearest Neighbours 0.937931 0.964824 0.956522 0.964912
0 2 Support Vector Classifier 0.944828 0.974874 0.913043 0.95614
0 3 Random Forest Classifier 0.958621 0.979899 0.956522 0.947368
0 4 Naive Bayes Classifier 0.917241 0.942211 0.956522 0.95614
0 5 Logistic Regression 0.931034 0.972362 0.913043 0.95614
0 6 Single Layer Perceptron 0.868966 0.937186 0.934783 0.947368
0 7 Multilayer Perceptron 0.868966 0.937186 0.956522 0.95614
0 8 Convolutional Neural Network 0.910345 0.944724 0.956522 0.95614

Final Model

For the final model, we aim to achieve the highest recall and accuracy when making predictions on unseen data. To do this, we will combine several of the best performing models that we will look at, and derive final predictions by averaging the results. The machine learning models that will be used in the final ensemble model will be: the KNN model, the Naive Bayes classifier, and the multilayer perceptron. These models will be weighted according to their performance in the previous section.

In [155]:
knn_optimised.best_estimator_.get_params()
Out[155]:
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 3,
 'p': 2,
 'weights': 'uniform'}
In [156]:
 nb_optimised.best_estimator_.get_params()
Out[156]:
{'priors': None, 'var_smoothing': 0.0001}
In [158]:
cnn_model.summary()
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv1d_5 (Conv1D)           (None, 9, 32)             96        
                                                                 
 dropout_13 (Dropout)        (None, 9, 32)             0         
                                                                 
 flatten_5 (Flatten)         (None, 288)               0         
                                                                 
 dense_16 (Dense)            (None, 64)                18496     
                                                                 
 dropout_14 (Dropout)        (None, 64)                0         
                                                                 
 dense_17 (Dense)            (None, 1)                 65        
                                                                 
=================================================================
Total params: 18,657
Trainable params: 18,657
Non-trainable params: 0
_________________________________________________________________
In [150]:
def averaged_predictions(x, m1, m2, m3, weights):
    y_m1 = m1.predict(x)
    y_m1 = [weights[0] if y > 0.5 else 1 - weights[0] for y in y_m1]
    y_m2 = m2.predict(x)
    y_m2 = [weights[1] if y > 0.5 else 1 - weights[1] for y in y_m2]
    y_m3 = m3.predict(x)
    y_m3 = [weights[2] if y > 0.5 else 1 - weights[2] for y in y_m3]
    result = [x + y + z for x, y, z in zip(y_m1, y_m2, y_m3)]
    result = [1 if y > sum(weights)/2 else 0 for y in result]
    return result
y_pred = averaged_predictions(X_test_selected, knn_optimised, nb_optimised, cnn_model, (1, 0.7, 0.8))
test_recall = recall_score(y_test,y_pred)
test_accuracy = accuracy_score(y_test, y_pred)
(test_recall, test_accuracy)
4/4 [==============================] - 0s 2ms/step
Out[150]:
(0.9782608695652174, 0.9649122807017544)